CUCM Database replication, how it works
CUCM Database replication, how it works
CUCM Database replication, how it works, or how it doesn’t. And when it doesn’t, the shit has hit the fan. I am really hoping that this particular post will expand an improve, because there is very little information available about the internals of the various UC platforms. That is probably Cisco’s way of trying to hide the fact that there is not much Cisco technology behind those platforms, but let me not go there. So if any of you have links, snippets or any info that could improve this post, inbox me!
CUCM uses Informix as its database , and there is no native access to it, other then through AXL queries and queries via CLI. (Unless you are a Cisco TAC engineer, in which case a local hash value gets put into a hash value, which will the generate temporary root access).
So replication in CUCM 6.x and higher is no longer hub spoke. In the hub spoke topology, an outage on the Publisher pretty much locked down the database.
Figure 1 – Database topology |
The current database topology is full mesh, every one connects to every one. It also means that if the publisher goes down, user facing features (such as CFWD All, MWI hunt Group Log out), can still be made.
Back to replication and more importantly, db replication issues. Informix uses, what is called Informix Data Replication (IDS) to replicate between servers
If db replication breaks, there are many symptoms such as phones registered on call manager group A are unable to call phones on call manager group B, or when logging into extension mobility on Subscriber A, when DB replication is bad another, phone registered on Subscriber B might not be able to complete a call to the newly logged in user. To cut a long story short, your users will be screaming at this stage, you’ll have managers at your desks begging for updates, you know what i’m talking about.
First port of call would be to verify the current replication status. There is various ways of doing this:
- run the utils dbreplication runtimestate from the server’s CLI (start with PUB)
- RTMT, Call manager>Service>Database Summary
- In Cisco Unified Reporting>Unified CM Database Status.
For example running the command from CLI, will produce the following output:
admin:utils dbreplication runtimestate
DB and Replication Services: ALL RUNNING
Cluster Replication State: Only available on the PUB
DB Version: ccm8_6_2
Number of replicated tables: 541
Cluster Detailed View from SUB (2Servers):
PING REPLICATION REPL. DBver& REPL. REPLICATION SETUP
SERVER-NAME IP ADDRESS (msec) RPC? STATUS QUEUE TABLES LOOP? (RTMT)
———– ———— —— —- ———– —– ——- —– —————–
CUCM1 10.1.1.1 0.357 Yes Connected 0 match Yes (2)
CUCM1 10.1.1.2 0.030 Yes Connected 0 match Yes (2)
Either one of these methods should produce the replication status on a logical connection, with any of the values as summarised in the table below:
Value | Meaning | Description |
---|---|---|
0 | Initialization State | This state indicates that replication is in the process of trying to setup. Being in this state for a period longer than an hour could indicate a failure in setup. |
1 | Number of Replicates not correct | This state is rarely seen in 6.x and 7.x but in 5.x can indicate its still in the setup process. Being in this state for a period longer than an hour could indicate a failure in setup. |
2 | Replication is good | Logical connections have been established and tables match the other servers on the cluster. |
3 | Tables are suspect | Logical connections have been established but we are unsure if tables match. In 6.x and 7.x all servers could show state 3 if one server is down in the cluster. This can happen because the other servers are unsure if there is an update to a user facing feature that has not been passed from that sub to the other device in the cluster. |
4 | Setup Failed / Dropped | The server no longer has an active logical connection to receive database table across. No replication is occurring in this state. |
If you see any value other then 2 for a longer period of time, you have an issue of some sort, and further diagnostics is required. The most common issues for with failed replication are:
- Connectivity issues between nodes (Most common)
- Host file mismatch
- Communication on UDP port 8500, not in phase 2
- DNS not configured properly (forward/reverse lookup)
- NTP not reachable/time drift between servers
- A Cisco DB replicator service not running/not working
- Cisco Database Layer monitor (DbMon) hung/stopped
The times that I have come across replication issues were, after DNS changes, IP changes and SDL links out of service, causing replication time outs.
Further troubleshooting:
The following 5 steps are basic points of verification, that should be carried out if any replication issue is encountered.
1-Check server/Cluster connectivity
From version 6.x, a full mesh topology exists between servers, therefore replication needs to exist between every node in the cluster. For intracluster TCP port requirements:
http://www.cisco.com/en/US/docs/voice_ip_comm/cucm/port/8_5_1/portlist851.html
Please note that just running the utils dbreplication runtimestate command and checking the PING column is NOT enough to verify connectivity. You really need to check connectivity on a port basis if you have have layer 3 boundaries within your cluster, that might be filtering. (use packet traces on your servers). Checking just ICMP only proves routing works and potentially nothing more that that. I have done a CUC deployement in the past that was giving me all sorts of replication issues, untill I found out that the customer was using a firewall between the two servers, just saying.
2-Check configuration files
This means checking the /etc/hosts file, .rhosts and sqlhosts files, through CUCM reporting.
File | Purpose |
---|---|
/etc/hosts | This file is used to locally resolve hostnames to IP addresses. It should include the hostname and IP address of all nodes in the cluster including CUPS nodes. |
/home/informix/.rhosts | A list of hostnames which are trusted to make database connections |
$INFORMIXDIR/etc/sqlhosts | Full list of CCM servers for replication. Servers here should have the correct hostname and node id (populated from the process node table). This is used to determine to which servers replicates are pushed. |
If you have replication issues after a host name change, or after re-IPing a server, very likely that the issue is with any of these files, not having been updated. that is why it is so bloody important to follow the installation instructions exactly. the only way to manually change any of these 3 files is through root access to the box, guess what? Exactly; TAC
3-Verify DNS (optional)
If DNS is configured on a particular server, it is required for both foward and reverse DNS to resolve correctly. Informix uses DNS frequently, and DNS failures or improper config in DNS can cause issues for replication.
4-NTP not reachable
verify connectivity from ANY server to your NTP server(s)
5-Check crucial services
A Cisco DB, A Cisco DB Replicator, Cisco Database layer Monitor.
(utils service list page)
Repair
OK, so at this stage when have verified all our config files, NTP, DNS etc etc. and we know from looking at the utils dbreplication runtimestateoutput, that some of the logical connections are not connected. i.e. replication ihas gone cactus.
command | function |
---|---|
utils dbreplication stop | Normally run on a subscriber. Stops currently replication running, restarts A Cisco DB Replicator, deletes marker file used to signal replication to begin. This can be run on each node of the cluster by doing “utils dbreplication stop”. In 7.1.2 and later “utils dbreplication stop all” can be run on the Publisher node to stop replication on all nodes |
utils dbreplication repair | |
utils dbreplication repairtable utils dbreplication repairreplicate |
Introduced in 7.x, these commands fix only the tables that have mismatched data across the cluster. This mismatched data is found by issuing a utils dbreplication status. These commands should only be used if logical connections have been established between the nodes. |
utils dbreplication reset | Always run from the publisher node, used to reset replication connections and do a broadcast of all the tables. This can be executed to one node by hostname “utils dbreplication reset nodename” or on all nodes by “utils dbreplication reset all”. |
utils dbreplication dropadmindb | Run on a publisher or subscriber, this command is used to drop the syscdr database. This clears out configuration information from the syscdr database which forces the replicator to reread the configuration files. Later examples talk about identifying a corrupt syscdr database. |
utils dbreplication clusterreset | Always run from the publisher node, used to reset replication connections and do a broadcast of all tables. Error checking is ignored. Following this command ‘utils dbreplication reset all’ should be run in order to get correct status information. Finally after that has returned to state 2 all subs in the cluster must be rebooted |
utils dbreplication runtimestate | Available in 7.X and later this command shows the state of replication as well as the state of replication repairs and resets. This can be used as a helpful tool to see which tables are replicating in the process. |
utils dbreplication forcedatasyncsub | This command forces a subscriber to have its data restored from data on the publisher. Use this command only after the ‘utils dbreplication repair’ command has been run several times and the ‘utils dbreplication status’ ouput still shows non-dynamic tables out of sync. |
Especially useful is the utils dbreplication reset command (actually, it is a script), which basically forces the replication to be rebuilt and started, in the same way as is done, during the installation.
1-Reestablish logical connections for replication to ALL servers
If multiple servers are in a state other than 2, either via utils dbreplication runtimestate command or via the reporting cdr list serv. The whole cluster will most likely have issues with replication. In which case:
- use utils dbreplication stop on each subscriber and lastly on the pub (it can take anywhere up to 300 seconds for this command to complete, depending on the repltimeout value. Issue show tech repltimeout, to verify what the timeout is set to.
- use utils dbreplication stop on the pub.
- use utils dbreplication reset all, from pub, to trigger new tables to be broadcast over all logical connections. This command will take a while to complete (Cisco mention 1 hr). monitor progress by means of the utils dbreplication runtimestate command
2-Reestablish logical connections for replication to single node
Use the same command as in scenario 1 above, but stop replication on only the relevant node, not on any of the other nodes OR the pub. After that use the utils dbreplication reset <nodename> command. Monitor progress, using the utils dbreplication runtimestate command.
3-All logical connections are connected, but tables are out of sync.
Let’s consider that particular scenario
- All nodes in the cluster are in Replication State = 3
- The report for Replication Server List and all servers show as connected in the list (each server shows local for itself but connected from other servers perspective).
In this particular case all logical connections are working fine, and there is therefore no need to stop db replication and reset it, as described in scenarios 1 and 2.
- utils dbreplication status from publisher. This command will confirm, that indeed there are mismatched tables.
- utils dbreplication repair from publisher – Which repair command depends on which tables and how many tables are out of sync. If you have a large cluster or a very large deployment a repair can take a long time (a day in some instances depending on how it failed), so it is important to use your commands as granular as possible in those circumstances. On smaller deployments (less than 5,000 phones for example, a utils dbreplication repair all is normally fine to do. If it is only one node with a bad table we can do utils dbreplication repair nodename. If it is only one table you can use utils dbreplication repairtable (node/all) to fix it on the problem server or on the whole cluster.
4-Corrupt syscdr
Issue a utils dbreplication from your failing server and check your output file for mismatching tables (file list active log <filename>).
- utils dbreplication stop – Issued failing Subscriber (failSUB)
- utils dbreplication dropadmindb – Issued on failSUB to restart the syscdr on that node.
- utils dbreplication reset <failSUB> – Issued from PUB, to try to reestablish the connection after fixing the corrupted syscdr.
Well, that wraps it up, if anyone knows of any good indept sources that discuss cucm DB replication and troubelshooting, please let me know
sources:
https://supportforums.cisco.com/videos/4894
supportforums.cisco.com/docs/DOC-13672
snippets:
How to view the syscdr file:
admin:file list activelog /cm/trace/dbl
admin:file view activelog /cm/trace/dbl/<date_time>_dbl_repl_cdr_Broadcast.log
How to view the creation of the syscdr file on pub:
file view activelog cm/log/informix/ccm.log